Two types of
video-driven Cloning
Network for Poses-to-Video Translation
Our main goal is to learn how to convert logical sequences of 2D human
poses into video frames that show a person that looks a lot like a
target actor in a reference video that has been provided. By taking
human poses from different driving videos and sending them into the
translator, such a translator might theoretically be used to clone
random video performances of other people.
Measures of Positivity
As was previously said, the degree to which the poses in the driving
video resemble the poses in the paired training data that was taken from
the reference video has a significant impact on the visual quality of
the frames produced by our network. Every driving stance does not have
to, however, exactly resemble a single pose in the training set. The
utilization of a PatchGAN discriminator allows our network to create a
frame by aggregating local patches observed throughout training.
Therefore, it is important to quantify the similarity between driving
and reference postures in a way that is both local and
translation-invariant. In this section, we propose a pose metrics that
attempt to measure similarity in such a manner.